Problem Statement and Hypothesis:

SF rents have been steadily rising since 2000 to 2010 when 28,500 newcomers arrived to the city. From 2010-2012, 20,600 additional people moved in, in just three years. People have attributed the growth of population to the blazing technology sector.

While the demand for housing increased, supply was relatiively fixed due to zoning laws. The most fundamental rule of economics dictates that prices rise when there is a high demand but low supply. I was interested in seeing how crime affects prices in certain neighbourhoods and if there was a relation between price and crime.

To complete this goal I analyzed trends of reported crime and rent hikes in different neighborhoods and reported trends with exploratory graphs. Predict prices of housing in San Francisco using districts, years, bedrooms, and aggregate crime counts per district per year. Use Regression, CART, Random Forests to predict rental prices based on variables such as categorized crime counts, bedrooms, and year.

In [1]:
import pandas as pd
import numpy as np
import shapely

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Description of Data:

The project involves an in depth analysis of the change of rental and crime rates in San Francisco from 2003-2015. Data was retrieved from a multitude of sources but mainly Eric Fischers github repo of aggregated Craigslist rental data archived from wayback machine and the SF crime data set from opendata.

Original data from craigslist appears as below, and I needed to parse and extract attributes specifically price, bedrooms, and district. This task was performed through a variety of data structure methods and regular expressions. The issue with craigslist data is that posts are inconsistent in structure and name of districts, to counter this problem I used scoring algorithms vs a dictionary of San Francisco districts to standardize my data.

alttext

Variable Engineering:

Variables were obtained from the craigslist data and locations of districts were scraped from Google. The rest of the variables were present in the crime dataset. Engineered a few binary features based on whether a crime was violent, alcohol related, organized, narcotic, or city related.

In [3]:
#crime = pd.read_csv('./train.csv')    # read in crime
crime = pd.read_csv('./SFPD_Incidents_-_from_1_January_2003.csv')
crime_sub = crime.sample(100000)   #subset sample
crime_sub.head(1)
Out[3]:
IncidntNum Category Descript DayOfWeek Date Time PdDistrict Resolution Address X Y Location PdId
1399038 101001884 WARRANTS WARRANT ARREST Thursday 10/28/2010 11:02 PARK ARREST, BOOKED SCOTT ST / LLOYD ST -122.435054 37.769897 (37.7698968941871, -122.435054154915) 10100188463010
In [2]:
district_coordinates = pd.read_csv('./district_coordinates.csv')   # read in coordinates
del district_coordinates['Unnamed: 0']
district_coordinates = district_coordinates.dropna()
district_coordinates.head()  # 4 coordinate places is an accuracy of 11 meters
Out[2]:
district lat lon
0 bernal heights 37.7412 -122.4178
1 castro 37.7609 -122.4350
2 chinatown 37.7941 -122.4078
4 excelsior 37.7244 -122.4272
5 financial district 37.7946 -122.3999
In [131]:
crime_sub.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 97203 entries, 2010-10-28 to 2014-02-03
Data columns (total 23 columns):
IncidntNum        97203 non-null int64
Category          97203 non-null object
Descript          97203 non-null object
DayOfWeek         97203 non-null object
Time              97203 non-null object
PdDistrict        97203 non-null object
Resolution        97203 non-null object
Address           97203 non-null object
X                 97203 non-null float64
Y                 97203 non-null float64
Location          97203 non-null object
PdId              97203 non-null int64
X1                97203 non-null float64
Y1                97203 non-null float64
match_district    97203 non-null object
is_theft          97203 non-null int64
is_drunk          97203 non-null int64
is_organized      97203 non-null int64
is_violent        97203 non-null int64
is_narcotic       97203 non-null int64
is_city           97203 non-null int64
year              97203 non-null int32
Date              97203 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int32(1), int64(8), object(9)
memory usage: 17.4+ MB
In [154]:
crime_sub.head()
Out[154]:
IncidntNum Category Descript DayOfWeek Time PdDistrict Resolution Address X Y ... Y1 match_district is_theft is_drunk is_organized is_violent is_narcotic is_city year Date
Date
2010-10-28 101001884 WARRANTS WARRANT ARREST Thursday 11:02 PARK ARREST, BOOKED SCOTT ST / LLOYD ST -122.435054 37.769897 ... 37.769897 lower haight 0 0 0 0 0 0 2010 2010-10-28
2006-01-05 60018714 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Thursday 18:10 BAYVIEW NONE DEHARO ST / MARIPOSA ST -122.401481 37.763562 ... 37.763562 potrero hill 1 0 0 0 0 0 2006 2006-01-05
2012-07-01 120517632 FRAUD FRAUDULENT USE OF AUTOMATED TELLER CARD Sunday 10:00 PARK NONE HAYES ST / PIERCE ST -122.434485 37.775417 ... 37.775417 western addition 0 0 0 0 0 0 2012 2012-07-01
2003-07-20 30862648 ASSAULT INFLICT INJURY ON COHABITEE Sunday 17:25 NORTHERN ARREST, BOOKED POLK ST / UNION ST -122.422340 37.798761 ... 37.798761 russian hill 0 0 0 1 0 0 2003 2003-07-20
2012-09-21 126152252 NON-CRIMINAL LOST PROPERTY Friday 18:00 SOUTHERN NONE 0 Block of THE EMBARCADEROSOUTH ST -122.388380 37.783310 ... 37.783310 south beach 0 0 0 0 0 0 2012 2012-09-21

5 rows × 23 columns

In [155]:
posts_crimes.head()
Out[155]:
D1_ARSON D1_ASSAULT D1_BAD CHECKS D1_BRIBERY D1_BURGLARY D1_DISORDERLY CONDUCT D1_DRIVING UNDER THE INFLUENCE D1_DRUG/NARCOTIC D1_DRUNKENNESS D1_EMBEZZLEMENT ... bed district1 district2 district3 post price year district1_code district2_code district3_code
0 0 9 0 0 4 1 0 1 0 1 ... 1 sunset parkside inner sunset 950 1br sunny one bedroom for rent sunset p... 950.0 2003 45 36 14
1 0 7 0 0 5 0 1 1 2 0 ... 2 marina nob hill nob hill 1500 2br 1 ba cow hollow open house thursday... 1500.0 2003 21 27 27
2 1 37 1 0 10 4 1 34 1 2 ... 2 hayes valley noe valley visitacion valley 1795 2br spacious completely remodeled 2 bd... 1795.0 2003 11 28 51
3 1 16 0 0 12 2 0 3 1 0 ... 0 pacific heights presidio heights inner richmond 1180 0br victorian hotel room wprivate bath ... 1180.0 2003 33 40 13
4 1 12 0 0 6 0 0 4 1 0 ... 1 inner richmond outer richmond inner sunset 1100 1br great landlordgreat tenats inner ri... 1100.0 2003 13 32 14

5 rows × 124 columns

Exploratory Data Aanlysis:

Below are a series of several plots that annotate crime per year for all of sf, median price per year for all of sf, crime/price for districts I thought would represent the most possible neighborhoods and population of San Francisco. Finally there is a series of plots of crimes of interest from the engineered dummy variables based on whether a crime was violent, alcohol related, organized, narcotic, or city related.

alt text

alt text

alt text

alt text

alt text

alt text

alt text

alt text

alt text

alt text

In [121]:
sns.set(rc={"figure.figsize": (10, 10)})
sns.set_style('white')


print crime_sub.Category.value_counts()[0:10]
crime_sub.Category.value_counts().plot(kind = 'barh')
LARCENY/THEFT     19684
OTHER OFFENSES    13846
NON-CRIMINAL      10166
ASSAULT            8310
VEHICLE THEFT      5946
DRUG/NARCOTIC      5940
VANDALISM          4935
WARRANTS           4694
BURGLARY           4063
SUSPICIOUS OCC     3404
Name: Category, dtype: int64
Out[121]:
<matplotlib.axes._subplots.AxesSubplot at 0x125d5a910>
In [122]:
sns.set(rc={"figure.figsize": (10, 10)})
sns.set_style('white')

price_by_hood_pre = posts_crimes[posts_crimes.bed == 1]
price_by_hood = pd.DataFrame(price_by_hood_pre.groupby(['district1','year']).median()['price']).reset_index()

price_by_hood[price_by_hood.year == (2003)].groupby(['district1','year']).median().sort_values('price').plot(kind = 'barh')

price_by_hood[price_by_hood.year == (2015)].groupby(['district1','year']).median().sort_values('price').plot(kind = 'barh')
Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x125402410>

Function to map crime location coordinates to districts

The crime dataset had different districts of SanFrancisco (Police Districts) than the craigslist data districts. This created a problem since there were no common columns to perform a merge on. To solve this issue I scraped coordinates for each craigslist district center and wrote a function that calculates the closest distance to the crime location coordinates. The function uses the haversine distance as a scoring metric creating a list of scores from each district to the crime location, the distance that is the lowest is assumbed to be the district that the crime is in.

In [4]:
# get decimals as a feature X1 X2 from the location coordinates
from decimal import Decimal
def replace_lon(location):
    
    lon = str(location).replace('(','').replace(')','').split(',')
    lon = np.float64(lon[1])
    return lon
    
def replace_lat(location):
    lat = str(location).replace('(','').replace(')','').split(',')
    lat = np.float64(lat[0])
    return lat
    
crime_sub['X1'] = crime_sub.apply(lambda row: replace_lon(row['Location']),axis = 1)
crime_sub['Y1'] = crime_sub.apply(lambda row: replace_lat(row['Location']),axis = 1)

# haversine distances to check which crime occured in what district
%timeit
from haversine import haversine
def check_distance(location):
    #print location
    haversine_list = []
    for d,lat,lon in zip(district_coordinates.district,district_coordinates.lat, district_coordinates.lon):

        haversine_list.append((haversine(location, (lon,lat)),d))
    
    match = sorted(haversine_list)[0][1]
    return match
    
crime_sub['match_district'] = crime_sub.apply(lambda row: check_distance((row['X1'],row['Y1'])), axis = 1 )

Create sub crime categories based on crimes that could be considered under the same category.

In [5]:
# create crime Categories
crime_sub.Category.value_counts()

theft = ['LARCENY/THEFT','VEHICLE THEFT','BURGLARY','ROBBERY','STOLEN PROPERTY','RECOVERED VEHICLE']
drunk = ['DRUG/NARCOTIC','DRUNKENNESS','LIQOUR LAWS','DRIVING UNDER THE INFLUENCE','LIQUOR LAWS']
organized = ['BRIBERY','EXTORTION','PROSTITUTION']
violent = ['ASSAULT','SEX OFFENSES, FORCIBLE','KIDNAPPING']
narcotic = ['DRUG/NARCOTIC']
city = ['VANDALISM','LOITERING']

crime_sub['is_theft'] = crime_sub.Category.apply(lambda x: 1 if x in theft else 0)
crime_sub['is_drunk'] = crime_sub.Category.apply(lambda x: 1 if x in drunk else 0)
crime_sub['is_organized'] = crime_sub.Category.apply(lambda x: 1 if x in organized else 0)
crime_sub['is_violent'] = crime_sub.Category.apply(lambda x: 1 if x in violent else 0)
crime_sub['is_narcotic'] = crime_sub.Category.apply(lambda x: 1 if x in narcotic else 0)
crime_sub['is_city'] = crime_sub.Category.apply(lambda x: 1 if x in city else 0)

Convert indexes and dates to datetime:

Use pandas date time to convert date column into the index, then use that information to create a yearly feature for the crimes.

In [6]:
from datetime import datetime as dt
print crime_sub.index.dtype
crime_sub['Date'] = pd.to_datetime(crime_sub['Date'])
crime_sub.set_index('Date',inplace= True)
print crime_sub.index.dtype

# create a year function, 2016 is not completely represented
crime_sub['year'] = crime_sub.index.map(lambda val: val.year)
crime_sub.head(1)
crime_sub = crime_sub[crime_sub['year'] < 2016]
int64
datetime64[ns]
In [7]:
posts = pd.read_csv('./craigslist_posts.csv')
posts.head()
posts = posts[posts.bed < 6]

Transformations:

Use a series of transformations to merge the data from 2 separate data frames of craigslists posts and crimes to aggregate counts of crime categories per district per year and the rental listing information.

In [8]:
crime_groups = pd.DataFrame(crime_sub.groupby(['year','Category','match_district']).size())

crime_wide = crime_groups.reset_index()
crime_wide.head()

crime_wide.columns = ['year','crime','district','crime_count']
crime_wide.head()

print [x for x in crime_wide.district.values if not type(x) == str]

crime_wide = crime_wide[[True if type(x) == str else False for x in crime_wide.district.values]]

crime_wide.district = crime_wide.district.map(lambda x: x.lower())
crime_wide_ = pd.pivot_table(crime_wide, index=['year','crime'], columns=['district'], fill_value=0)['crime_count'].reset_index()
crime_wide_.head()

crime_trans = crime_wide.T

posts = posts.iloc[:,1:]
posts.head()

crime_trans.columns = np.ravel(crime_trans.ix['crime',:].values)
crime_trans.head()

small_posts = posts

small_posts.shape

small_posts.year.value_counts()
[]
Out[8]:
2005    15455
2004     6433
2006     4529
2016     2494
2012     2283
2014     2218
2015     2030
2011     1476
2007      884
2013      772
2003      497
2008      487
2009      195
2010       99
Name: year, dtype: int64
In [ ]:
 
In [ ]:
# Transformations to aggregate counts

new_dfs = []

for i, row in small_posts.iterrows():
    
    ndf = pd.DataFrame({
            'post':[row.post],
            'year':[row.year],
            'price':[row.price],
            'bed':[row.bed],
            'district1':[row.district1],
            'district2':[row.district2],
            'district3':[row.district3]
        })
    ndf.index = [i]
    
    d1_mask_colmask = (crime_trans.loc['year',:] == row.year) & (crime_trans.loc['district',:] == row.district1)
    d2_mask_colmask = (crime_trans.loc['year',:] == row.year) & (crime_trans.loc['district',:] == row.district2)
    d3_mask_colmask = (crime_trans.loc['year',:] == row.year) & (crime_trans.loc['district',:] == row.district3)
    
    district1_crimes = crime_trans.ix[['crime_count'], d1_mask_colmask]
    district2_crimes = crime_trans.ix[['crime_count'], d2_mask_colmask]
    district3_crimes = crime_trans.ix[['crime_count'], d3_mask_colmask]
    
    district1_crimes.columns = ['D1_'+x for x in district1_crimes.columns]
    district2_crimes.columns = ['D2_'+x for x in district2_crimes.columns]
    district3_crimes.columns = ['D3_'+x for x in district3_crimes.columns]
    
    dist_crimes = pd.concat([district1_crimes, district2_crimes, district3_crimes],
                            axis=1, ignore_index=False)
    
    
    dist_crimes.index = [i]
    
    ndf = pd.concat([ndf, dist_crimes], axis=1, ignore_index=False)
    
    new_dfs.append(ndf)
    
posts_crimes = pd.concat(new_dfs, axis=0, ignore_index=False).fillna(value=0, axis=1)
    
    
    
    
In [ ]:
posts_crimes.head()
In [ ]:
from sklearn.preprocessing import LabelEncoder

posts_crimes.district1 = posts_crimes.district1.map(lambda x: 'none' if x == 0 else x)
posts_crimes.district2 = posts_crimes.district2.map(lambda x: 'none' if x == 0 else x)
posts_crimes.district3 = posts_crimes.district3.map(lambda x: 'none' if x == 0 else x)

label_encoder = LabelEncoder()

dists = posts_crimes.district1.values.tolist() + posts_crimes.district2.values.tolist() + posts_crimes.district3.values.tolist()
dists = np.unique(dists)
# print dists
# print posts_crimes.district1.unique()
# print posts_crimes.district2.unique()
# print posts_crimes.district3.unique()


label_encoder.fit(dists)

posts_crimes['district1_code'] = label_encoder.transform(posts_crimes.district1.values)
posts_crimes['district2_code'] = label_encoder.transform(posts_crimes.district2.values)
posts_crimes['district3_code'] = label_encoder.transform(posts_crimes.district3.values)
In [ ]:
posts_crimes.to_csv('./aggregate_df.csv')
In [132]:
posts_crimes = pd.read_csv('./aggregate_df.csv')
del posts_crimes['Unnamed: 0']
In [133]:
posts_crimes.head(1)
Out[133]:
D1_ARSON D1_ASSAULT D1_BAD CHECKS D1_BRIBERY D1_BURGLARY D1_DISORDERLY CONDUCT D1_DRIVING UNDER THE INFLUENCE D1_DRUG/NARCOTIC D1_DRUNKENNESS D1_EMBEZZLEMENT ... bed district1 district2 district3 post price year district1_code district2_code district3_code
0 0 9 0 0 4 1 0 1 0 1 ... 1 sunset parkside inner sunset 950 1br sunny one bedroom for rent sunset p... 950.0 2003 45 36 14

1 rows × 124 columns

In [134]:
posts_crimes = posts_crimes[posts_crimes.price < 50000]
In [135]:
target = posts_crimes.price
tmp = posts_crimes[[col for col in posts_crimes if posts_crimes[col].dtype != object or col == 'district1']]

print posts_crimes.post.dtype
object
In [136]:
tmp.head()
Out[136]:
D1_ARSON D1_ASSAULT D1_BAD CHECKS D1_BRIBERY D1_BURGLARY D1_DISORDERLY CONDUCT D1_DRIVING UNDER THE INFLUENCE D1_DRUG/NARCOTIC D1_DRUNKENNESS D1_EMBEZZLEMENT ... D3_VEHICLE THEFT D3_WARRANTS D3_WEAPON LAWS bed district1 price year district1_code district2_code district3_code
0 0 9 0 0 4 1 0 1 0 1 ... 17 0 1 1 sunset 950.0 2003 45 36 14
1 0 7 0 0 5 0 1 1 2 0 ... 0 0 0 2 marina 1500.0 2003 21 27 27
2 1 37 1 0 10 4 1 34 1 2 ... 57 22 6 2 hayes valley 1795.0 2003 11 28 51
3 1 16 0 0 12 2 0 3 1 0 ... 24 7 0 0 pacific heights 1180.0 2003 33 40 13
4 1 12 0 0 6 0 0 4 1 0 ... 17 0 1 1 inner richmond 1100.0 2003 13 32 14

5 rows × 121 columns

In [137]:
bigml = [c for c in tmp if 'D1' in c ]+['bed','year','district1_code','district1','price']
bigml = tmp[bigml]
bigml.to_csv('bigml.csv')
In [138]:
#pd.read_csv('./bigml.csv')
In [139]:
D1 = [c for c in tmp if 'D1' in c ]+['bed','year','district1_code']
tmp[D1].head()
Out[139]:
D1_ARSON D1_ASSAULT D1_BAD CHECKS D1_BRIBERY D1_BURGLARY D1_DISORDERLY CONDUCT D1_DRIVING UNDER THE INFLUENCE D1_DRUG/NARCOTIC D1_DRUNKENNESS D1_EMBEZZLEMENT ... D1_SUICIDE D1_SUSPICIOUS OCC D1_TRESPASS D1_VANDALISM D1_VEHICLE THEFT D1_WARRANTS D1_WEAPON LAWS bed year district1_code
0 0 9 0 0 4 1 0 1 0 1 ... 0 4 0 13 7 1 1 1 2003 45
1 0 7 0 0 5 0 1 1 2 0 ... 0 1 0 7 13 1 1 2 2003 21
2 1 37 1 0 10 4 1 34 1 2 ... 0 9 8 21 37 30 3 2 2003 11
3 1 16 0 0 12 2 0 3 1 0 ... 0 4 2 7 18 0 0 0 2003 33
4 1 12 0 0 6 0 0 4 1 0 ... 1 9 3 5 24 7 0 1 2003 13

5 rows × 41 columns

In [140]:
district_codes_df = tmp[D1]['district1_code']
In [141]:
district_codes_df.head()
Out[141]:
0    45
1    21
2    11
3    33
4    13
Name: district1_code, dtype: int64

Model Selection:

The goal of this analysis was to use the types of crime 40 categories, # of bedrooms, years and the district code as a numerical variable to predict the price of rent. To accomplish this I first used linear-regression and performed an r^2 value of .43. As a result of having so many different features 40 crime categories, year price I felt that using a decision tree with cross validation of train test split and sample splits of 2,5,10 would perform better. I used grid search to find the best DecisionTreeRegressor (.57) and a depth of 9. Next I used feature importances to select for the highest performers and ran the best model decision tree on only the seected features and it performed at .60 with the most important features being bed year district and larceny. Next I ran random forests and the score improved to .62 with the selected features.

In [149]:
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split

X = tmp[D1]
y = target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

lm = LinearRegression()

model = lm.fit(X_train, y_train)

y_pred = lm.predict(X_test)
print model.score(X_test, y_test)   # coefficient of r2
print model.score(X_test,y_pred)
0.455895185182
1.0
In [150]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(tmp[D1], target, test_size=0.33)

y_train = np.ravel(y_train)
y_test = np.ravel(y_test)


param_grid = {'max_depth': np.arange(3, 10),'min_samples_split':[2,5,10]}

tree = GridSearchCV(DecisionTreeRegressor(criterion='mse'), param_grid, cv = 10,n_jobs = -1)

tree.fit(X_train, y_train)

tree_preds = tree.predict(X_test)
print X_test.shape,y_test.shape


tree_performance = tree.score(X_test, y_test)

print tree_preds
print tree_performance
print tree.best_params_
print tree.grid_scores_
print tree.best_estimator_
print tree.best_score_
(13143, 41) (13143,)
[ 1603.94230769  1274.03778338  2067.10958904 ...,  2013.09161793
  1806.70878274  2189.43556701]
0.573459294574
{'min_samples_split': 2, 'max_depth': 9}
[mean: 0.39309, std: 0.02225, params: {'min_samples_split': 2, 'max_depth': 3}, mean: 0.39309, std: 0.02225, params: {'min_samples_split': 5, 'max_depth': 3}, mean: 0.39309, std: 0.02225, params: {'min_samples_split': 10, 'max_depth': 3}, mean: 0.42304, std: 0.02635, params: {'min_samples_split': 2, 'max_depth': 4}, mean: 0.42304, std: 0.02635, params: {'min_samples_split': 5, 'max_depth': 4}, mean: 0.42304, std: 0.02635, params: {'min_samples_split': 10, 'max_depth': 4}, mean: 0.47444, std: 0.03590, params: {'min_samples_split': 2, 'max_depth': 5}, mean: 0.47444, std: 0.03590, params: {'min_samples_split': 5, 'max_depth': 5}, mean: 0.47444, std: 0.03590, params: {'min_samples_split': 10, 'max_depth': 5}, mean: 0.51637, std: 0.04423, params: {'min_samples_split': 2, 'max_depth': 6}, mean: 0.51587, std: 0.04329, params: {'min_samples_split': 5, 'max_depth': 6}, mean: 0.51607, std: 0.04287, params: {'min_samples_split': 10, 'max_depth': 6}, mean: 0.54331, std: 0.04860, params: {'min_samples_split': 2, 'max_depth': 7}, mean: 0.54271, std: 0.04784, params: {'min_samples_split': 5, 'max_depth': 7}, mean: 0.54318, std: 0.04759, params: {'min_samples_split': 10, 'max_depth': 7}, mean: 0.56261, std: 0.05004, params: {'min_samples_split': 2, 'max_depth': 8}, mean: 0.55813, std: 0.05339, params: {'min_samples_split': 5, 'max_depth': 8}, mean: 0.55765, std: 0.04973, params: {'min_samples_split': 10, 'max_depth': 8}, mean: 0.57600, std: 0.05552, params: {'min_samples_split': 2, 'max_depth': 9}, mean: 0.56659, std: 0.05633, params: {'min_samples_split': 5, 'max_depth': 9}, mean: 0.57130, std: 0.05213, params: {'min_samples_split': 10, 'max_depth': 9}]
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
0.576001303094
In [151]:
# best fit tree regressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(tmp[D1], target, test_size=0.33)

y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

tree = DecisionTreeRegressor(criterion='mse', max_depth = 9, min_samples_split = 5)

tree.fit(X_train, y_train)

tree_preds = tree.predict(X_test)
print X_test.shape,y_test.shape


tree_performance = tree.score(X_test, y_test)

print tree_preds
print 'performance r2:',tree_performance
print tree.feature_importances_

features_df = pd.DataFrame([col for col in tmp[D1]],[i for i in tree.feature_importances_]).reset_index()

sns.set(rc={"figure.figsize": (10,10)})
sns.set_style('white')
features_df.rename(columns = {'index':'score',0:'features'},inplace = True)

features_df.set_index('features').score.sort_values().plot(kind = 'barh')

# 'year','bed','weapon laws',
(13143, 41) (13143,)
[ 1696.94495413  1149.96159122  2052.98426745 ...,  1480.02469136
  4054.64285714  2281.22155689]
performance r2: 0.580719889268
[  2.04996391e-05   2.25524899e-02   5.72851345e-06   1.60984497e-04
   4.84608756e-03   2.16567898e-04   6.04747672e-04   1.81078571e-02
   3.82280014e-04   1.11331887e-04   1.63530988e-04   3.15323147e-04
   1.32751768e-04   7.46055853e-04   0.00000000e+00   0.00000000e+00
   9.44086129e-02   1.11448899e-03   6.02381927e-05   3.89807586e-02
   5.17583784e-03   8.66237166e-03   0.00000000e+00   1.79853255e-04
   9.09357705e-04   2.33767588e-03   7.75982631e-04   5.13641614e-03
   8.01989893e-04   7.69576040e-05   3.35363190e-04   0.00000000e+00
   5.15675458e-03   2.66605829e-03   2.93904668e-03   8.24517110e-03
   4.71566463e-03   8.25894119e-05   4.00477619e-01   2.99707413e-01
   6.86875422e-02]
Out[151]:
<matplotlib.axes._subplots.AxesSubplot at 0x127217dd0>
In [58]:
sns.set(rc={"figure.figsize": (10, 10)})
sns.set_style('white')
sns.regplot(y_test,tree_preds)
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x124712590>
In [152]:
selected_features = [i for i in features_df[features_df.score > .01].features]

# best fit tree regressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(tmp[D1][selected_features], target, test_size=0.33)

y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

tree = DecisionTreeRegressor(criterion='mse', max_depth = 9, min_samples_split = 5)

tree.fit(X_train, y_train)

tree_preds = tree.predict(X_test)
print X_test.shape,y_test.shape

print target[0:3]
tree_performance = tree.score(X_test, y_test)

print tree_preds
print 'performance r2:',tree_performance
print tree.feature_importances_

#eatures_df = pd.DataFrame([col for col in tmp[D1]],[i for i in tree.feature_importances_]).reset_index()
(13143, 7) (13143,)
0     950.0
1    1500.0
2    1795.0
Name: price, dtype: float64
[ 3785.69117647  4729.39002268  1802.14620939 ...,  1443.95833333
  1627.39379475  2592.77777778]
performance r2: 0.600801665065
[ 0.03932828  0.02173101  0.10407693  0.04684223  0.40231574  0.30341123
  0.08229458]

Random Forest

In [153]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import cross_val_score

forest = RandomForestRegressor(random_state=0, n_estimators=1000, min_samples_split = 5)



X_train, X_test, y_train, y_test = train_test_split(tmp[D1], target, test_size=0.33)

y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

forest.fit(X_train, y_train)

forest_preds = forest.predict(X_test)

# forest_preds = cross_val_score(estimator = forest, X_test, y= y_test, scoring=None, cv=5, n_jobs=-1, verbose=0, fit_params=None)

forest_performance = forest.score(X_test, y_test)

print forest_preds
print 'forest r2:',forest_performance
print forest
print forest.get_params
[ 2798.00801042  1455.01134417  2529.72549368 ...,  2506.23580086
  1360.07531667   977.14651236]
forest r2: 0.57764370193
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=5, min_weight_fraction_leaf=0.0,
           n_estimators=1000, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)
<bound method RandomForestRegressor.get_params of RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=5, min_weight_fraction_leaf=0.0,
           n_estimators=1000, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)>
In [63]:
print 'performance', forest_performance
sns.set(rc={"figure.figsize": (10, 10)})
sns.set_style('white')
sns.regplot(y_test,forest_preds)
performance 0.574853421998
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x156f6a4d0>
In [64]:
# random forest selected features

# random forest selected features
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import cross_val_score

forest = RandomForestRegressor(random_state=0, n_estimators=1000, min_samples_split = 5)



X_train, X_test, y_train, y_test = train_test_split(tmp[D1][selected_features], target, test_size=0.33)

y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

forest.fit(X_train, y_train)

forest_preds = forest.predict(X_test)

# forest_preds = cross_val_score(estimator = forest, X_test, y= y_test, scoring=None, cv=5, n_jobs=-1, verbose=0, fit_params=None)

forest_performance = forest.score(X_test, y_test)

print forest_preds
print 'forest r2:',forest_performance
print forest
print forest.get_params
[ 2634.46229774  2274.59509883  1991.6478776  ...,  3963.92108756
  9349.0070456   1616.4868673 ]
forest r2: 0.620942969865
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=5, min_weight_fraction_leaf=0.0,
           n_estimators=1000, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)
<bound method RandomForestRegressor.get_params of RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=5, min_weight_fraction_leaf=0.0,
           n_estimators=1000, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)>
In [72]:
def prediction_plot(pred_array, title, x_limit, actual):
    fig = plt.figure(figsize=(12,10))
    
    frame1=fig.add_axes((.1,.3,.8,.6))
    
    mean_ = np.mean(pred_array)
    min_ = np.min(pred_array)
    max_ = np.max(pred_array)
    ax = fig.gca()
    
    ax.plot(range(len(pred_array)), pred_array, 
               alpha=0.5, c='darkred', label='predicted Rental Price')
    
    ax.scatter(range(len(actual)), actual, s=35,
               alpha=0.7, c='blue', label='actual Rental Price')
    
    zero = ax.axhline(0, lw=1.5, c='black')
    
    mean_line = ax.axhline(mean_, ls='-', lw=1., c='black', alpha=0.5,
                     label='Mean predicted val: %0.2f'%mean_)
              
    max_line = ax.axhline(max_, ls='--', lw=1., c='blue',alpha=0.7, 
               label='Max predicted val: %0.2f'%max_)
    min_line = ax.axhline(min_, ls='--', lw=1., c='green',alpha=0.7,
              label='Min predicted val: %0.2f'%min_)
    
    ax.legend(fontsize=16, bbox_to_anchor=(1.5,1.))
    ax.set_xlim([0,x_limit])
    ax.set_ylim([100,7000])
    ax.set_ylabel('Rental Price', fontsize=16)
    ax.set_title(title, fontsize=17)
    
    #Residual plot box
    residual_error = actual - pred_array
    residual_error = residual_error.tolist()
    
    frame2=fig.add_axes((.1,.1,.8,.2))
    frame2.scatter(range(len(residual_error)), residual_error, c='darkred', label='Residual errors')
    
    
    for i in range(len(residual_error)):
        frame2.plot((i,i),(0,residual_error[i]), ls='--', c='green')
    
    frame2.axhline(0, lw=1.5, c='black', ls='--')
    frame2.set_ylim([-500,500])
    frame2.set_xlim([0,x_limit])
    frame2.legend(bbox_to_anchor=(1.25,1.0), fontsize=15)
    plt.show()
    
    
def plot_list(plot_funtion ,lists, titles, x_limit, actual):
    for i,a in enumerate(lists):
        plot_funtion(a,titles[i], x_limit, actual)
        
plot_list(prediction_plot,[forest_preds], ['Random Forest'], 50, y_test)

Random forest for each district and scores

In [156]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
features = {}

tmp = posts_crimes[[col for col in posts_crimes if posts_crimes[col].dtype != object or col == 'district1']]
D1 = [c for c in tmp if 'D1' in c ]+['bed','year','district1','price']
group_forest_df = tmp[D1]

for district, group_df in tmp[D1].groupby("district1"):
    try: 
        
        target = group_df.price
        predictors = group_df[[col for col in group_df if col not in ['price','district1']]]


        X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.33)


        y_train = np.ravel(y_train)
        y_test = np.ravel(y_test)

        model = RandomForestRegressor()

        forest = model.fit(X_train, y_train)
        forest_performance = model.score(X_test, y_test)
        print district, forest_performance
        features[district] = forest.feature_importances_
        
    except:
        continue
0 0.4534026884
bernal heights 0.669536523388
castro -0.432123693863
chinatown 0.0
downtown 0.525188033295
excelsior -2.0503125
financial district 0.180485130237
glen park 0.762569529916
golden gate park 0.0038805721573
haight ashbury 0.809309542235
hayes valley 0.667443686363
ingleside 0.615445378753
inner richmond 0.713054895864
lakeshore -32.935475
lone mountain 0.505649765158
lower haight 0.818314085141
marina 0.753731939664
merced 0.116390353074
mission bay 0.819564733915
mission district 0.692517015357
nob hill 0.527521115059
noe valley 0.756407496379
north beach 0.511068043108
oceanview -0.0545609254282
outer mission 0.719535573327
outer richmond -1.28764172336
pacific heights 0.696433784448
panhandle -2.17870662571
parkside 0.201585818162
portola 0.84213957272
potrero hill 0.728205741709
presidio 0.468807806929
presidio heights 0.742877588801
russian hill 0.379193824493
seacliff 0.651312352943
south beach 0.35067550601
south of market 0.258286039219
sunset 0.630659937749
tenderloin -0.0527941949927
treasure island -3.49360795048
twin peaks 0.480643935319
upper market 0.62693356218
usf 0.800772275956
visitacion valley 0.261279243516
western addition 0.630770298188
In [127]:
features_df = pd.DataFrame(features).T
In [128]:
features_df.columns = [i for i in predictors.columns]
features_df = features_df.T
In [129]:
features_df.head()
Out[129]:
0 bernal heights castro chinatown downtown excelsior financial district glen park golden gate park haight ashbury ... south beach south of market sunset tenderloin treasure island twin peaks upper market usf visitacion valley western addition
D1_ARSON 0.0 0.001166 0.000000 0.0 0.0 0.0 0.000195 0.000000 0.000000 0.000001 ... 0.001021 0.000000 0.000544 0.001373 0.000000 0.000023 0.0 0.001198 0.001107 0.000000
D1_ASSAULT 0.0 0.001177 0.000000 0.0 0.0 0.0 0.037276 0.002759 0.000000 0.003895 ... 0.001894 0.000138 0.007426 0.028113 0.000001 0.013675 0.0 0.000665 0.001693 0.005648
D1_BAD CHECKS 0.0 0.000052 0.000000 0.0 0.0 0.0 0.000010 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000488 0.000000 0.000000 0.000162 0.0 0.000000 0.002878 0.000000
D1_BRIBERY 0.0 0.000569 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.076303 0.000000 ... 0.000017 0.000000 0.000215 0.000022 0.000000 0.000000 0.0 0.000072 0.001388 0.000185
D1_BURGLARY 0.0 0.009469 0.080556 0.1 0.0 0.0 0.022503 0.005642 0.002413 0.004071 ... 0.000926 0.015208 0.008084 0.012495 0.103052 0.003624 0.0 0.017258 0.008406 0.022088

5 rows × 45 columns

Deeper Analysis:

Saw price differences and theft differences in different neighborhoods, thought something more complex was happening. See how different neighborhoods react according to the model . Saw different feature importances meaning that something interesting is going on. SOMA has a very high feature importance of Larceny and Theft while no other neighborhood does. In addition while running my model of random forest for each individual district I saw a huge variance in the score for each district, with some at .8 like USF while another one had -.32 like Lakeshore which means the distribution of counts of each district widely varies.

In [130]:
for index,i in enumerate(['mission district','south of market','marina','russian hill','north beach','tenderloin']):
    plt.figure(index)
    features_df_sub = features_df[i].sort_values()
    sns.set_style('white')
    features_df_sub.plot(kind = 'barh',title = i +' feature_importances')
    print i
mission district
south of market
marina
russian hill
north beach
tenderloin

Mapping:

In [86]:
map_choro = pd.read_csv('/Users/Stav/Downloads/new_geoms_2.csv')

from branca.colormap import linear

colormap = linear.YlOrRd.scale(
    1000,3500)
    #map_choro.price.max())

price_dict = {}
for k,v in zip(map_choro['district1'],map_choro['price']):
    price_dict[k] = v
    
color_dict = {key: colormap(price_dict[key]) for key in price_dict.keys()}
color_dict
Out[86]:
{'bayview': '#fec863',
 'bernal heights': '#fea747',
 'castro/upper market': '#feb54f',
 'chinatown': '#ffe98e',
 'diamond heights': '#e9251f',
 'downtown/civic center': '#feb650',
 'excelsior': '#fff6a4',
 'financial district': '#b20026',
 'glen park': '#fe9640',
 'golden gate park': '#feae4a',
 'haight ashbury': '#feb852',
 'inner richmond': '#fec15b',
 'inner sunset': '#b10026',
 'lakeshore': '#fe923e',
 'marina': '#fe9a42',
 'mission district': '#fea847',
 'nob hill': '#feb54f',
 'noe valley': '#fe943f',
 'north beach': '#fd853a',
 'ocean view': '#fea345',
 'outer mission': '#fec35e',
 'outer richmond': '#fed673',
 'outer sunset': '#febf59',
 'pacific heights': '#fd582d',
 'parkside': '#cf0f20',
 'potrero hill': '#fd6831',
 'presidio': '#fd7635',
 'presidio heights': '#f13723',
 'russian hill': '#fd7e37',
 'seacliff': '#fec35e',
 'south of market': '#b10026',
 'twin peaks': '#fea245',
 'visitacion valley': '#fd7f38',
 'west of twin peaks': '#febf59',
 'western addition': '#ffdb79'}

Conclusions:

Different feature importances show that neighborhoods attributes such as crime counts of different types affect pricing differently. From my analysis we can say that crime is not directly related to pricing but yet pricing might affect which types of crimes take place where.

In [89]:
from IPython.display import HTML
HTML('<iframe width="100%" height="520" frameborder="0" src="https://stavgrossfeld.cartodb.com/viz/f411c652-3cdf-11e6-8073-0ecfd53eb7d3/embed_map" \
     allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen></iframe>')
from folium.map import Popup, Icon, Marker
import folium
from folium import plugins

crime_sub_map = crime_sub
crime_sub_map['Date'] = crime_sub.index

#Cartodb Dark_matter
m = folium.Map(location=[37.7599,-122.431297],tiles='Cartodb Dark_matter', zoom_start = 12, min_zoom = 12, max_zoom = 15)

folium.GeoJson(open('/Users/Stav/Downloads/new_geoms_2.geojson'),style_function=lambda feature:{'fillColor': color_dict[feature['properties']['district1']],'fillOpacity':'.5','color' : 'white','weight':'3','dashArray':'5,5'}).add_to(m)

marker_cluster = folium.MarkerCluster("Theft").add_to(m)

for each in crime_sub_map[crime_sub_map['is_theft']==1][0:1000].iterrows():
    #print each[1]['Category'], each[1]['DayOfWeek'], each['Time']
    popup = str('crime: ' + each[1]['Category'] + '\n' + 'date: ' + str(each[1]['Date']).split()[0] + '\n'  'day: '  + each[1]['DayOfWeek'] + '\n' + 'time: ' + each[1]['Time'])
    folium.Marker(location = [each[1]['Y'],each[1]['X']],popup=popup).add_to(marker_cluster)
    
    
    
    

print colormap
m
# display(HTML("""
# <div style="position: absolute">
#     <div id='legend-form'>
#         <h4>rental price</h4>
#            <select name="something">
#                <option>price</option>
#                <option>crime</option>
#            </select>
#     </div>
# </div>
# """))
<branca.colormap.LinearColormap object at 0x1246fc2d0>
Out[89]:
In [91]:
colormap
Out[91]:
10003500
In [157]:
posts_crimes.head()
Out[157]:
D1_ARSON D1_ASSAULT D1_BAD CHECKS D1_BRIBERY D1_BURGLARY D1_DISORDERLY CONDUCT D1_DRIVING UNDER THE INFLUENCE D1_DRUG/NARCOTIC D1_DRUNKENNESS D1_EMBEZZLEMENT ... bed district1 district2 district3 post price year district1_code district2_code district3_code
0 0 9 0 0 4 1 0 1 0 1 ... 1 sunset parkside inner sunset 950 1br sunny one bedroom for rent sunset p... 950.0 2003 45 36 14
1 0 7 0 0 5 0 1 1 2 0 ... 2 marina nob hill nob hill 1500 2br 1 ba cow hollow open house thursday... 1500.0 2003 21 27 27
2 1 37 1 0 10 4 1 34 1 2 ... 2 hayes valley noe valley visitacion valley 1795 2br spacious completely remodeled 2 bd... 1795.0 2003 11 28 51
3 1 16 0 0 12 2 0 3 1 0 ... 0 pacific heights presidio heights inner richmond 1180 0br victorian hotel room wprivate bath ... 1180.0 2003 33 40 13
4 1 12 0 0 6 0 0 4 1 0 ... 1 inner richmond outer richmond inner sunset 1100 1br great landlordgreat tenats inner ri... 1100.0 2003 13 32 14

5 rows × 124 columns

In [116]:
from folium.map import Popup, Icon, Marker
import folium
from folium import plugins

from IPython.display import clear_output
from ipywidgets import *
from IPython.display import display



hello = [i for i in crime_sub.Category.unique()]


from ipywidgets import Dropdown


crime = Dropdown(
   options= hello,
   value = hello[0],
   description='Crimes: ',
)



submit_button = Button(description="Submit Info")
submit_button.on_click(test)  # test is my mapping function

display(crime)
display(submit_button)

def test(crime_sub):
    clear_output()
    print crime.value
    crime_select = crime_sub_map[crime_sub_map.Category == crime.value]
    
    select_map = folium.Map(location=[37.7599,-122.431297],tiles='Cartodb Dark_matter', zoom_start = 12, min_zoom = 12, max_zoom = 15)

    folium.GeoJson(open('/Users/Stav/Downloads/new_geoms_2.geojson'),style_function=lambda feature:{'fillColor': color_dict[feature['properties']['district1']],'fillOpacity':'.5','color' : 'white','weight':'3','dashArray':'5,5'}).add_to(select_map)

    marker_cluster = folium.MarkerCluster("select").add_to(select_map)

    for each in crime_select[0:1000].iterrows():
        #print each[1]['Category'], each[1]['DayOfWeek'], each['Time']
        popup = str('crime: ' + each[1]['Category'] + '\n' + 'date: ' + str(each[1]['Date']).split()[0] + '\n'  'day: '  + each[1]['DayOfWeek'] + '\n' + 'time: ' + each[1]['Time'])
        folium.Marker(location = [each[1]['Y'],each[1]['X']],popup=popup).add_to(marker_cluster)
    clear_output()
    display(select_map)
    display(colormap)


    clear_output
# after they submit destination.values
10003500
In [ ]:
 
In [ ]:
 
In [ ]: